Loan Data Exploration

by: Seni Kamara

Preliminary Wrangling

This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, and many others (Udacity, DAND, 2019).

In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline
%config InlineBackend.figure_format = 'retina'
In [2]:
# Read data from file into DF ld
ld_c = pd.read_csv('/Users/SeniKamara/BESTPICK/DATA_SCIENCE/DAND/PROJECT_5/PROJECT_5.5/PROJECT_SUBMISSION/prosperLoanData.csv')
ld_c.head()
Out[2]:
ListingKey ListingNumber ListingCreationDate CreditGrade Term LoanStatus ClosedDate BorrowerAPR BorrowerRate LenderYield ... LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
0 1021339766868145413AB3B 193129 2007-08-26 19:09:29.263000000 C 36 Completed 2009-08-14 00:00:00 0.16516 0.1580 0.1380 ... -133.18 0.0 0.0 0.0 0.0 1.0 0 0 0.0 258
1 10273602499503308B223C1 1209647 2014-02-27 08:28:07.900000000 NaN 36 Current NaN 0.12016 0.0920 0.0820 ... 0.00 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
2 0EE9337825851032864889A 81716 2007-01-05 15:00:47.090000000 HR 36 Completed 2009-12-17 00:00:00 0.28269 0.2750 0.2400 ... -24.20 0.0 0.0 0.0 0.0 1.0 0 0 0.0 41
3 0EF5356002482715299901A 658116 2012-10-22 11:02:35.010000000 NaN 36 Current NaN 0.12528 0.0974 0.0874 ... -108.01 0.0 0.0 0.0 0.0 1.0 0 0 0.0 158
4 0F023589499656230C5E3E2 909464 2013-09-14 18:38:39.097000000 NaN 36 Current NaN 0.24614 0.2085 0.1985 ... -60.27 0.0 0.0 0.0 0.0 1.0 0 0 0.0 20

5 rows × 81 columns

In [3]:
# First we make copy of the original DF
ld = ld_c.copy()
In [4]:
# Looking at the overall shape of the DF
ld.shape
Out[4]:
(113937, 81)
In [5]:
# Looking at the structure of the dataset
ld.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113937 entries, 0 to 113936
Data columns (total 81 columns):
ListingKey                             113937 non-null object
ListingNumber                          113937 non-null int64
ListingCreationDate                    113937 non-null object
CreditGrade                            28953 non-null object
Term                                   113937 non-null int64
LoanStatus                             113937 non-null object
ClosedDate                             55089 non-null object
BorrowerAPR                            113912 non-null float64
BorrowerRate                           113937 non-null float64
LenderYield                            113937 non-null float64
EstimatedEffectiveYield                84853 non-null float64
EstimatedLoss                          84853 non-null float64
EstimatedReturn                        84853 non-null float64
ProsperRating (numeric)                84853 non-null float64
ProsperRating (Alpha)                  84853 non-null object
ProsperScore                           84853 non-null float64
ListingCategory (numeric)              113937 non-null int64
BorrowerState                          108422 non-null object
Occupation                             110349 non-null object
EmploymentStatus                       111682 non-null object
EmploymentStatusDuration               106312 non-null float64
IsBorrowerHomeowner                    113937 non-null bool
CurrentlyInGroup                       113937 non-null bool
GroupKey                               13341 non-null object
DateCreditPulled                       113937 non-null object
CreditScoreRangeLower                  113346 non-null float64
CreditScoreRangeUpper                  113346 non-null float64
FirstRecordedCreditLine                113240 non-null object
CurrentCreditLines                     106333 non-null float64
OpenCreditLines                        106333 non-null float64
TotalCreditLinespast7years             113240 non-null float64
OpenRevolvingAccounts                  113937 non-null int64
OpenRevolvingMonthlyPayment            113937 non-null float64
InquiriesLast6Months                   113240 non-null float64
TotalInquiries                         112778 non-null float64
CurrentDelinquencies                   113240 non-null float64
AmountDelinquent                       106315 non-null float64
DelinquenciesLast7Years                112947 non-null float64
PublicRecordsLast10Years               113240 non-null float64
PublicRecordsLast12Months              106333 non-null float64
RevolvingCreditBalance                 106333 non-null float64
BankcardUtilization                    106333 non-null float64
AvailableBankcardCredit                106393 non-null float64
TotalTrades                            106393 non-null float64
TradesNeverDelinquent (percentage)     106393 non-null float64
TradesOpenedLast6Months                106393 non-null float64
DebtToIncomeRatio                      105383 non-null float64
IncomeRange                            113937 non-null object
IncomeVerifiable                       113937 non-null bool
StatedMonthlyIncome                    113937 non-null float64
LoanKey                                113937 non-null object
TotalProsperLoans                      22085 non-null float64
TotalProsperPaymentsBilled             22085 non-null float64
OnTimeProsperPayments                  22085 non-null float64
ProsperPaymentsLessThanOneMonthLate    22085 non-null float64
ProsperPaymentsOneMonthPlusLate        22085 non-null float64
ProsperPrincipalBorrowed               22085 non-null float64
ProsperPrincipalOutstanding            22085 non-null float64
ScorexChangeAtTimeOfListing            18928 non-null float64
LoanCurrentDaysDelinquent              113937 non-null int64
LoanFirstDefaultedCycleNumber          16952 non-null float64
LoanMonthsSinceOrigination             113937 non-null int64
LoanNumber                             113937 non-null int64
LoanOriginalAmount                     113937 non-null int64
LoanOriginationDate                    113937 non-null object
LoanOriginationQuarter                 113937 non-null object
MemberKey                              113937 non-null object
MonthlyLoanPayment                     113937 non-null float64
LP_CustomerPayments                    113937 non-null float64
LP_CustomerPrincipalPayments           113937 non-null float64
LP_InterestandFees                     113937 non-null float64
LP_ServiceFees                         113937 non-null float64
LP_CollectionFees                      113937 non-null float64
LP_GrossPrincipalLoss                  113937 non-null float64
LP_NetPrincipalLoss                    113937 non-null float64
LP_NonPrincipalRecoverypayments        113937 non-null float64
PercentFunded                          113937 non-null float64
Recommendations                        113937 non-null int64
InvestmentFromFriendsCount             113937 non-null int64
InvestmentFromFriendsAmount            113937 non-null float64
Investors                              113937 non-null int64
dtypes: bool(3), float64(50), int64(11), object(17)
memory usage: 68.1+ MB
In [6]:
# Need to convert DF to Excel for visual assessment
from pandas import ExcelWriter

ld_300 = ld.head(300)

writer = ExcelWriter('LoanData.xlsx')
ld_300.to_excel(writer)
writer.save()

ASSESS

Exploratory Analysis of: "Loan Data from Prosper" - ld_300 table

Data Overview:

Loan Data from Prosper: This data set contains 113,937 loans with 81 variables on each loan.

After doing some programmatic assessment:
  • dtypes: bool(3), float64(50), int64(11), object(17)
  • memory usage: 68.1+ MB
  • RangeIndex: 113937 entries, 0 to 113936
  • Data columns (total 81 columns)

  • This information was very valuable for understanding which variables are "qualitative" and which are "quantitative". That inturn dictates which plot we're going to use for their exploration.

What is the structure of your dataset?

  • There're 113,937 loans with 81 variables on each loan in the dataset. There are numeric variables such as:

BorrowerAPR, MonthlyLoanPayment, EmploymentStatusDuration, DebtToIncomeRatio, LoanOriginalAmount.

  • And there're also categorical variables such as:

LoanStatus: Chargedoff, Defaulted, Past Due, Current, Completed;

IncomeRange: Not employed, Not displayed, USD 1-24,999, USD 25,000-49,999, USD 50,000-74,999, USD 75,000-99,999, USD 100,000+;

The main features of interest in my dataset are:

How the borrower's terms affect the loan terms?

What affects the borrower’s APR or interest rate?

Is there a relationship beteween employment, income, and the loan amount?

What are characteristics of the large original loan amounts?

What are risky characteristics of "bad" (delinquent) loans?

What features in the dataset do you think will help support your investigation into your feature(s) of interest?

After doing some visual assessment:

Decided to group and specify variables that I think will help support my investigation into my features of interest:

Loan parameters

  • ListingCategory (numeric)
  • LoanStatus
  • Term
  • LoanOriginalAmount
  • BorrowerAPR
  • MonthlyLoanPayment

Borrower’s parameters

  • Occupation
  • EmploymentStatus
  • EmploymentStatusDuration

Income

  • DebtToIncomeRatio
  • IncomeRange
  • IncomeVerifiable

Univariate Exploration

Starting exploration with Borrowers parameters, qualitative variables
Define
  • Going to use Barcharts to explore qualitative variable: Occupation
Code
In [7]:
# Setting plot size
fig = plt.figure(figsize = [8,18])

# Setting color and order
base_color = sb.color_palette()[0]
oc_counts = ld['Occupation'].value_counts()
oc_order = oc_counts.index

# get proportion taken by most common group for derivation of tick marks
n_ld = ld.shape[0]
max_count = ld['Occupation'].value_counts().max()
max_prop = max_count / n_ld

# generate tick marks locations and names
tick_props = np.arange(0, max_prop, 0.5)
tick_names = ['{:0.2f}'.format(v) for v in tick_props]

# Adding the relative parameters to the plot
plt.xticks(tick_props * n_ld, tick_names)

# Labeling the x axis
plt.xlabel('proportion');
plt.xticks(rotation = 90); #Rotating the labels 90 deg not to overlap

# Labelling the bars with ratio parameters
for i in range(oc_counts.shape[0]):
    count = oc_counts[i]
    pct_string = '{:0.1f}%'.format(100*count/n_ld)
    plt.text(count+1, i, pct_string, va = 'center');
    
    
sb.countplot(data = ld, y = 'Occupation', 
             color = base_color, order = oc_order);
After doing some exploratory analysis of Occupation:
  • Other makes the biggest category with 25.1%
  • The second largest Ocupation category is Professional with 12%
  • The rest of the occupations follow with a large gap - in 3rd place is Computer Programmer with 3.9%

Insight:

  • Most borrowers don't respond with the occupation in the list provided.
  • A large number of borrowers responded with Professional - what do they do for a living?
  • The bank doesn't give loans to students

Question: What's the true meaning of the value: Professional? Why a quarter of all borrowers indicated Other for their occupation, and does it affect loan parameters?

Define
  • Going to use Bar charts to explore qualitative variables: IsBorrowerHomeowner
Code
In [8]:
# Checking for the different values in the IsBorrowerHomeowner column
ld.IsBorrowerHomeowner.value_counts()
Out[8]:
True     57478
False    56459
Name: IsBorrowerHomeowner, dtype: int64
In [9]:
# Checking for null values
ld.IsBorrowerHomeowner.isnull().sum()
Out[9]:
0
After doing some programmatic assessment of IsBorrowerHomeowner:
  • Homeowner and non-homeowners evenly devided between borrowers - 1000 difference.

    Insight: Lack of homeonwership doesn't decrease chances of getting a loan

    Question: Does homeownership affect the APR or the LA (Loan Amount)?

Define
  • Going to use Bar charts to explore qualitative variables: EmploymentStatus
Code
In [10]:
# Checking for the different values in the LoanStatus column
ld.EmploymentStatus.value_counts()
Out[10]:
Employed         67322
Full-time        26355
Self-employed     6134
Not available     5347
Other             3806
Part-time         1088
Not employed       835
Retired            795
Name: EmploymentStatus, dtype: int64
In [11]:
# Checking for null values
ld.EmploymentStatus.isnull().sum()
Out[11]:
2255
In [12]:
# Setting plot size
plt.figure(figsize = [8, 5])

# Setting color and order
base_color = sb.color_palette()[0]
es_counts = ld['EmploymentStatus'].value_counts()
es_order = es_counts.index

# get proportion taken by most common group for derivation of tick marks
n_ld = ld.shape[0]
max_count = ld['LoanStatus'].value_counts().max()
max_prop = max_count / n_ld

# Adding the relative parameters to the plot
plt.xticks(tick_props * n_ld, tick_names)

# generate tick marks locations and names
tick_props = np.arange(0, max_prop, 0.5)
tick_names = ['{:0.2f}'.format(v) for v in tick_props]

# Labeling the x axis
plt.xlabel('proportion');
plt.xticks(rotation = 90); #Rotating the labels 90 deg not to overlap

# Labelling the bars with ratio parameters
for i in range(es_counts.shape[0]):
    count = es_counts[i]
    pct_string = '{:0.1f}%'.format(100*count/n_ld)
    plt.text(count+1, i, pct_string, va = 'center');
    
    
sb.countplot(data = ld, y = 'EmploymentStatus', 
             color = base_color, order = es_order);
After doing some exploratory analysis of EmploymentStatus:
  • 59.1% of all borrowers are Employed
  • 23.1% of all borrowers are Full-time
  • 5.4% of all borrowers are Self-employed
  • 4.7% of all borrowers are Not-Available
  • 2255 entries with Null values

    Insight: Around 4.7% of borrowers don't have a regular income.

    Question: Is there a difference in loan terms for these 4.7%? And what is the difference between "Employed" and "Full-time"?

Define
  • Going to use Bar charts to explore qualitative variables: LoanStatus
Code
In [13]:
# Checking for the different values in the LoanStatus column
ld.LoanStatus.value_counts()
Out[13]:
Current                   56576
Completed                 38074
Chargedoff                11992
Defaulted                  5018
Past Due (1-15 days)        806
Past Due (31-60 days)       363
Past Due (61-90 days)       313
Past Due (91-120 days)      304
Past Due (16-30 days)       265
FinalPaymentInProgress      205
Past Due (>120 days)         16
Cancelled                     5
Name: LoanStatus, dtype: int64
In [14]:
ld.LoanStatus.isnull().sum()
Out[14]:
0
In [15]:
# Setting plot size
plt.figure(figsize = [8, 5])
# Setting color and order
base_color = sb.color_palette()[0]
ls_counts = ld['LoanStatus'].value_counts()
ls_order = ls_counts.index

# get proportion taken by most common group for derivation of tick marks
n_ld = ld.shape[0]
max_count = ld['LoanStatus'].value_counts().max()
max_prop = max_count / n_ld

# Adding the relative parameters to the plot
plt.xticks(tick_props * n_ld, tick_names)

# generate tick marks locations and names
tick_props = np.arange(0, max_prop, 0.5)
tick_names = ['{:0.2f}'.format(v) for v in tick_props]

# Labeling the x axis
plt.xlabel('proportion');
plt.xticks(rotation = 90); #Rotating the labels 90 deg not to overlap

# Labelling the bars with ratio parameters
for i in range(ls_counts.shape[0]):
    count = ls_counts[i]
    pct_string = '{:0.1f}%'.format(100*count/n_ld)
    plt.text(count+1, i, pct_string, va = 'center');
    
    
sb.countplot(data = ld, y = 'LoanStatus', 
             color = base_color, order = ls_order);
In [16]:
# Combining categories of 'Past Due' values; also 'FinalPaymentInProgress' with 'Completed'
ld['LoanStatus'] = ld['LoanStatus'].replace('Past Due (1-15 days)', 'Past Due')
ld['LoanStatus'] = ld['LoanStatus'].replace('Past Due (31-60 days)', 'Past Due')
ld['LoanStatus'] = ld['LoanStatus'].replace('Past Due (61-90 days)', 'Past Due')
ld['LoanStatus'] = ld['LoanStatus'].replace('Past Due (91-120 days)', 'Past Due')
ld['LoanStatus'] = ld['LoanStatus'].replace('Past Due (16-30 days)', 'Past Due')
ld['LoanStatus'] = ld['LoanStatus'].replace('Past Due (>120 days)', 'Past Due')
ld['LoanStatus'] = ld['LoanStatus'].replace('FinalPaymentInProgress', 'Completed')
In [17]:
# Dropping the outliers 'Cancelled'
#df = df[df.line_race != 0]
ld = ld[ld.LoanStatus != 'Cancelled']
In [18]:
# Checking the code
ld.LoanStatus.value_counts()
Out[18]:
Current       56576
Completed     38279
Chargedoff    11992
Defaulted      5018
Past Due       2067
Name: LoanStatus, dtype: int64
In [19]:
# Setting plot size
plt.figure(figsize = [8, 5])
# Setting color and order
base_color = sb.color_palette()[0]
ls_counts = ld['LoanStatus'].value_counts()
ls_order = ls_counts.index

# get proportion taken by most common group for derivation of tick marks
n_ld = ld.shape[0]
max_count = ld['LoanStatus'].value_counts().max()
max_prop = max_count / n_ld

# Adding the relative parameters to the plot
plt.xticks(tick_props * n_ld, tick_names)

# generate tick marks locations and names
tick_props = np.arange(0, max_prop, 0.5)
tick_names = ['{:0.2f}'.format(v) for v in tick_props]


# Labelling the bars with ratio parameters
for i in range(ls_counts.shape[0]):
    count = ls_counts[i]
    pct_string = '{:0.1f}%'.format(100*count/n_ld)
    plt.text(count+1, i, pct_string, va = 'center');
    
sb.countplot(data = ld, y = 'LoanStatus', 
             color = base_color, order = ls_order);
After doing some exploratory analysis of LoanStatus :
  • Almost 50% of loans are current (paying interest)
  • 33% of all loans are repaid
  • 10.5% are written off (can't be recovered)
  • 4.5% are deliquent

    Insight: 16.7% of all loans are bad business

    Wrangling: Combined the "Past Due" loans. Combined 'FinalPaymentInProgress' with 'Completed' loans. Dropped 'Cancelled' loans. Organised into ordered categorical type

    Question: Are there common characteristics among borrowers or their loan terms, that default? Are there similarities among borrowers or their loan terms, that repaid?

In [20]:
# Assess the values of 'IncomeVerifiable' column
ld.IncomeVerifiable.value_counts()
Out[20]:
True     105264
False      8668
Name: IncomeVerifiable, dtype: int64
In [21]:
# Checking for Null entries
ld.IncomeVerifiable.isnull().sum()
Out[21]:
0
After doing some exploratory analysis of IncomeVerifiable :
  • About 8% of all borrowers can't verify their income

    Insight: 8% of borrowers get loans withouth proving their income

    Question: Are there common characteristics among borrowers or their loan terms, that can't verify income?

Define
  • Going to use Barchart to explore quantitative variable: ListingCategory (numeric). Going to rename the variable to: LoanPurpose
Code
In [22]:
# Renaming the column
# df.rename(columns={"A": "a", "B": "c"})
ld.rename(columns={'ListingCategory (numeric)': 'LoanPurpose'}, inplace = True)
Test
In [23]:
# Testing code
ld.LoanPurpose[3:5]
Out[23]:
3    16
4     2
Name: LoanPurpose, dtype: int64
In [24]:
ld.LoanPurpose.value_counts()
Out[24]:
1     58307
0     16961
7     10494
2      7433
3      7189
6      2572
4      2395
13     1996
15     1522
18      885
14      876
20      771
19      768
5       756
16      304
11      217
8       199
10       91
9        85
12       59
17       52
Name: LoanPurpose, dtype: int64
In [25]:
# Setting the size of the plot
fig = plt.figure(figsize = [8,18])

# Setting color and order
base_color = sb.color_palette()[0]
lp_counts = ld['LoanPurpose'].value_counts()
lp_order = lp_counts.index

# get proportion taken by most common group for derivation of tick marks
n_ld = ld.shape[0]
max_count = ld['LoanPurpose'].value_counts().max()
max_prop = max_count / n_ld

# generate tick marks locations and names
tick_props = np.arange(0, max_prop, 0.5)
tick_names = ['{:0.2f}'.format(v) for v in tick_props]

# Adding the relative parameters to the plot
plt.xticks(tick_props * n_ld, tick_names)

# Labelling the bars with ratio parameters
for i in range(lp_counts.shape[0]):
    count = lp_counts[i]
    pct_string = '{:0.1f}%'.format(100*count/n_ld)
    plt.text(count+1, i, pct_string, va = 'center');
    
    
sb.countplot(data = ld, y = 'LoanPurpose', 
             color = base_color);
After doing some exploratory analysis of LoanPurpose :
  • 51.2% of all loans have purpose: 1
  • In the second place with 14.9% of all loans have purpose: 0
  • In the third place with 9.2% purpose: 7
  • Purposes: 2 and 3 are in fourth place with 6.5% and 6.3% respectively

    Insight: Very big gap between pupose 1 and the other purposes.

    Question: What is purpose 1 and why there is such a big gap?

Define
  • Going to use Barchart to explore qualitative variable: IncomeRange
Code
In [26]:
# Exploring values of the 'IncomeRange' variable
ld.IncomeRange.value_counts()
Out[26]:
$25,000-49,999    32191
$50,000-74,999    31050
$100,000+         17337
$75,000-99,999    16916
Not displayed      7737
$1-24,999          7274
Not employed        806
$0                  621
Name: IncomeRange, dtype: int64
In [27]:
# Combining "$0" and "Not empoloyed" values
ld['IncomeRange'] = ld['IncomeRange'].replace('$0', 'Not employed')
In [28]:
ld.IncomeRange.value_counts()
Out[28]:
$25,000-49,999    32191
$50,000-74,999    31050
$100,000+         17337
$75,000-99,999    16916
Not displayed      7737
$1-24,999          7274
Not employed       1427
Name: IncomeRange, dtype: int64
In [29]:
ld.LoanStatus.value_counts()
Out[29]:
Current       56576
Completed     38279
Chargedoff    11992
Defaulted      5018
Past Due       2067
Name: LoanStatus, dtype: int64
In [30]:
# convert LoanStatus and IncomeRange into ordered categorical types
ordinal_var_dict = {'LoanStatus': ['Chargedoff', 'Defaulted', 'Past Due', 'Current', 'Completed'],
                    'IncomeRange': ['Not employed', 'Not displayed', '$1-24,999', '$25,000-49,999', 
                                    '$50,000-74,999', '$75,000-99,999', '$100,000+']}

for var in ordinal_var_dict:
    ordered_var = pd.api.types.CategoricalDtype(ordered = True,
                                                categories = ordinal_var_dict[var])
    
ld[var] = ld[var].astype(ordered_var)
 
In [31]:
plt.figure(figsize = [8,5])
#ax = fig.add_axes([1, 5, 2, 10])

# Setting color and order
base_color = sb.color_palette()[0]
ir_counts = ld['IncomeRange'].value_counts()
#ir_order = ir_counts.index

# get proportion taken by most common group for derivation of tick marks
n_ld = ld.shape[0]
max_count = ld['IncomeRange'].value_counts().max()
max_prop = max_count / n_ld

# Adding the relative parameters to the plot
plt.xticks(tick_props * n_ld, tick_names)

# generate tick marks locations and names
tick_props = np.arange(0, max_prop, 0.5)
tick_names = ['{:0.2f}'.format(v) for v in tick_props]

# Labelling the bars with ratio parameters
for i in range(ir_counts.shape[0]):
    count = ir_counts[i]
    pct_string = '{:0.1f}%'.format(100*count/n_ld)
    plt.text(count+1, i, pct_string, va = 'center');
    
    
sb.countplot(data = ld, y = 'IncomeRange', 
             color = base_color);
After doing some exploratory analysis of IncomeRange :
  • Income ranges: USD 25,000-49,999 and USD 50,000-74,999 make up 28.3% and 27.3% respectively of all loans, and they make up the largest quantities
  • Income ranges: USD 100,000+ and USD 75,000-99,999 make up 15.2% and 14.8% respectively of all loans. The second largest quantities.

Insight: Income ranges: USD 100,000+ and USD 75,000-99,999 together make up 30% of all loans.

Wrangling: Convert the income ranges into categorical type. Combined '0' and 'Not employed'

Question: Any differences in the loan parameters between the lower ranges and the higher ranges? Any similarities in the loan parameters within the ranges?

Now we're going to perform 'Univariate Exploration' on the quantitative variables:

Define
  • Going to use Histograms to explore quantitative variable: Term
Code
In [32]:
# Exploring the values of the 'Term' variable
ld.Term.value_counts()
Out[32]:
36    87773
60    24545
12     1614
Name: Term, dtype: int64
In [33]:
ld.Term.isnull().sum()
Out[33]:
0
In [34]:
labels = '60 months', '36 months', '12 months'
sizes = [24545, 87778, 1614]
colors = ['gold', 'yellowgreen', 'lightcoral']

plt.title('Loan Term (months)')


# Plot
plt.pie(sizes, labels=labels, colors=colors, 
        autopct='%1.1f%%', shadow=True, startangle=140)

plt.axis('equal')
plt.show()
After doing some exploratory analysis of Term :
  • 77% of all loans are taken/given for 3 years
  • 21,5% of all loans are taken/given for 5 years
  • Only 1.4% of all loans are taken/given for one year

    Insight: Lender specializes on commercial loans with 1yr., 3yr. and 5yr terms.

    Question: Any differences between 3yr. and 5yr loans? Any similarities within 3yr and 5yr loans?

Define
  • Going to use Histograms to explore quantitative variable: LoanOriginalAmount
Code
In [35]:
ld.LoanOriginalAmount.isnull().sum()
Out[35]:
0
In [36]:
# Plotting LoanOriginalAmount

bins = np.arange(0, ld['LoanOriginalAmount'].max()+1000, 1500)

plt.title('Original Loan Amount')
plt.xlabel('Loan Amount in USD')
plt.ylabel('Count')

plt.hist(data = ld, x = "LoanOriginalAmount", bins = bins);
In [37]:
# df.columns.get_loc("pear")
ld.columns.get_loc('LoanOriginalAmount')
Out[37]:
63
In [38]:
# Alternative slicing to avoid error, none of which work
# x[(slice(None),1)]
# data.ix[:3, :'pop']
# data.loc[:'Illinois', :'pop']
sb.distplot(ld['LoanOriginalAmount']);
/Users/SeniKamara/anaconda3/lib/python3.7/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
In [39]:
# Organizing the bins for better visualisation
bin_edges = np.arange(2500, 35000+5000, 5000)
plt.hist(ld['LoanOriginalAmount'] , bins = bin_edges, rwidth = 0.7)
plt.xticks(np.arange(5000, 35000+2500, 5000))
plt.xlabel('Loan Amount ($)')
plt.ylabel('Number of Loans');
After doing some exploratory analysis of LoanOriginalAmount :
  • The graph shows that most of the loans are up to USD 5000
  • Graph also shows that most of the loans are taken/given in USD 5000 increments: USD 5K, USD 10000, USD 15000, USD 20000, USD 25000, USD 30000 and USD 35000

    Insight: Lender specializes on commercial loans with, loan values (LV): USD 5K, USD 10K, USD 20K, USD 25K, USD 30K, and USD 35K

    Question: Is there a difference or similarities in borrowers and loan terms according to LV?

Define
  • Going to use Histograms to explore quantitative variable: BorrowerAPR
Code
In [40]:
ld.BorrowerAPR.isnull().sum()
Out[40]:
25
In [41]:
# Dropping the null values
# df.dropna(subset=['name', 'born'])
ld.dropna(subset=['BorrowerAPR'], inplace=True)
In [42]:
ld.BorrowerAPR.isnull().sum()
Out[42]:
0
In [43]:
# Plotting histogram with variable 'BorrowerAPR'
# Organizing the bins for better visualisation
bin_edges = np.arange(0.05, 0.4+0.1, 0.1)
plt.hist(ld['BorrowerAPR'] , bins = bin_edges, rwidth = 0.7)
plt.xticks(np.arange(0.1, 0.4+0.05, 0.1))
plt.xlabel('Borrower APR')
plt.ylabel('Number of Loans');
After doing some exploratory analysis of BorrowerAPR :
  • The plot shows that most loans are taken/given around 0.2% rate.
  • Loans with 0.1% and with ) 0.3% APR are in second place
  • The least quantity of loans in the portfolio have 0.4% APR.

    Insight: Quantity of loans with 0.4% is considerably small compared to the other rates.

    Wrangling: Dropped null values.

    Question: Why is there such a big gap between the loans with 0.4% APR and the other rates, considering that 0.1% difference between rates is not that significant?

Define
  • Going to use Histograms to explore quantitative variable: MonthlyLoanPayment
Code
In [44]:
ld.MonthlyLoanPayment.isnull().sum()
Out[44]:
0
In [45]:
# Plotting histogram with variable 'MonthlyLoanPayment'
plt.hist(ld['MonthlyLoanPayment']);
plt.xlabel = 'Monthly payments (in USD)'
plt.ylabel = 'Number of Loans'
In [46]:
# Organizing the bins for better visualisation
plt.figure(figsize = [10, 5])
bin_edges = np.arange(50, 1250+100, 100)
plt.xticks(np.arange(100, 1250+100, 100))
plt.xlabel = 'Monthly payments (in USD)'
plt.ylabel = 'Number of Loans'
plt.hist(ld['MonthlyLoanPayment'] , bins = bin_edges, rwidth = 0.7);
In [47]:
# Converting the xscale and ticks for better readability 
bin_edges = 10 ** np.arange(0.8, np.log10(ld.MonthlyLoanPayment.max())+0.1, 0.05)
plt.hist(ld.MonthlyLoanPayment, bins = bin_edges)
plt.xlabel = 'Monthly payments (in USD)'
plt.ylabel = 'Number of Loans'
plt.xscale('log')
tick_locs = [10, 30, 100, 300, 500, 1000, 3000]
plt.xticks(tick_locs, tick_locs);
After doing some exploratory analysis of MonthlyLoanPayment :
  • The plot shows bimodal distribution with picks at USD 160 and USD 400 monthly payments.

    Insight: A very large amount of loans is generating over USD 400 monthly payments.

    Exploration: looked at the data using a log transform. Under the transformation, the data looked bimodal, with one peak at USD 160 and, and another at USD 400.

    Question: Do smaller payments and therefore smaller loans have less delinquent Loan Statuses?

Define
  • Going to use Histograms to explore quantitative variable: EmploymentStatusDuration
Code
In [48]:
ld.EmploymentStatusDuration.isnull().sum()
Out[48]:
7596
In [49]:
ld_em =ld.dropna(subset=['EmploymentStatusDuration'])
In [50]:
ld_em.EmploymentStatusDuration.isnull().sum()
Out[50]:
0
In [51]:
# Plotting histogram with variable 'EmploymentStatusDuration'
plt.hist(data = ld, x = 'EmploymentStatusDuration');
/Users/SeniKamara/anaconda3/lib/python3.7/site-packages/numpy/lib/histograms.py:754: RuntimeWarning: invalid value encountered in greater_equal
  keep = (tmp_a >= first_edge)
/Users/SeniKamara/anaconda3/lib/python3.7/site-packages/numpy/lib/histograms.py:755: RuntimeWarning: invalid value encountered in less_equal
  keep &= (tmp_a <= last_edge)

The plot is scewed to the right with a long tail, so we're going to use log scale to make sense of the outliers.

In [52]:
np.log10(ld['EmploymentStatusDuration'].describe())
/Users/SeniKamara/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: RuntimeWarning: divide by zero encountered in log10
  """Entry point for launching an IPython kernel.
Out[52]:
count    5.026578
mean     1.982598
std      1.975344
min          -inf
25%      1.414973
50%      1.826075
75%      2.136721
max      2.877947
Name: EmploymentStatusDuration, dtype: float64
In [53]:
# Plotting the variable on the log scale
plt.figure(figsize = [8, 5])
bin_edges = 10 ** np.arange(-1, 5+0.1, 0.1)
plt.hist(data = ld, x = 'EmploymentStatusDuration' , bins = bin_edges)
plt.xscale('log');

Obvious outliers on the left side of the distribution

In [54]:
# Converting the xscale and ticks for better readability 
plt.figure(figsize = [8, 5])
bin_edges = 10 ** np.arange(0.8, np.log10(ld.EmploymentStatusDuration.max())+0.1, 0.1)
plt.hist(ld.EmploymentStatusDuration, bins = bin_edges)
plt.xscale('log')

#It is important that the xticks are specified after xscale since that 
#function has its own built-in tick settings.
tick_locs = [10, 30, 100, 300, 1000, 3000]
plt.xticks(tick_locs, tick_locs);
In [55]:
# Select low outliers
low_outliers = (ld_em['EmploymentStatusDuration'] < 12)
print(low_outliers.sum())
13721
In [56]:
# Select high outliers
high_outliers = (ld_em['EmploymentStatusDuration'] > 300)
print(high_outliers.sum())
5073
After doing some exploratory analysis of EmploymentStatusDuration :
  • The plot shows that most borrowers have: 50 to 200 months or 4 to 16 years of employment history.

    Insight: Most of the borrowers have a solid employment history. 13721 loans are given to borrowers with less than 12 months of employment.

    Question: What are the loan terms for the outliers, especially for those with less than 12 months of employment history?

Define
  • Going to use Histograms to explore quantitative variable: DebtToIncomeRatio
Code
In [57]:
# Checking for nulls in DI
ld.DebtToIncomeRatio.isnull().sum()
Out[57]:
8554
In [58]:
# Checking for nulls in ld_em (dropped EMP nulls)
ld_em.DebtToIncomeRatio.isnull().sum()
Out[58]:
8424
In [59]:
# Dropping the nulls in DI
ld_em_di = ld.dropna(subset=['DebtToIncomeRatio'])
In [60]:
# Checking code
ld_em_di.DebtToIncomeRatio.isnull().sum()
Out[60]:
0
In [61]:
ld_em_di['DebtToIncomeRatio'].describe()
Out[61]:
count    105353.000000
mean          0.275998
std           0.551826
min           0.000000
25%           0.140000
50%           0.220000
75%           0.320000
max          10.010000
Name: DebtToIncomeRatio, dtype: float64

Very unusual value: max = 10; How can a ratio of Debt to Income be 10, that means the debt burden is 10 times higher than income. How is that possible?

In [62]:
# Plotting histogram with variable 'DebtToIncomeRatio'
plt.hist(data = ld_em_di, x = 'DebtToIncomeRatio')
plt.xlim(0, 3);
In [63]:
# Organizing the bins for better visualisation
plt.figure(figsize = [8, 6])
bin_edges = np.arange(0.05, 1+0.15, 0.1)
plt.hist(ld_em_di['DebtToIncomeRatio'] , bins = bin_edges, rwidth = 0.6)
plt.xticks(np.arange(0.1, 1+0.15, 0.1));
In [64]:
# Converting the xscale and ticks for better readability 
bin_edges = 10 ** np.arange(0.01, np.log10(ld.DebtToIncomeRatio.max())+0.1, 0.1)
plt.hist(ld_em_di.DebtToIncomeRatio, bins = bin_edges)
plt.xscale('log')

#It is important that the xticks are specified after xscale since that 
#function has its own built-in tick settings.
tick_locs = [0.5, 1, 2, 5, 10, 15, 20]
plt.xticks(tick_locs, tick_locs);
In [65]:
# Select risky loans
risky_ratio = (ld_em_di['DebtToIncomeRatio'] > 0.6)
print(risky_ratio.sum())
2545
In [66]:
# Select more_risky loans - more than 80% of income
more_risky = (ld_em_di['DebtToIncomeRatio'] > 0.8)
print(more_risky.sum())
1205
In [67]:
# Select very_risky loans - more than 100% of income
very_risky = (ld_em_di['DebtToIncomeRatio'] > 1)
print(very_risky.sum())
799

Loans with "DebtToIncomeRatio" more than 60% are riskier than the ones with a smaller ratio.

Insigt: It will be good to check the "LoanStatus" and other characteristics of these riskier loans

This is an area of special attention. Debt Income ratio more than "1" means that the borrower has more debts than income, and that is problematic situation!

After doing some exploratory analysis of DebtToIncomeRatio :
  • The plot shows that most loans have Debt to Income ratio between 0.1 and 0.2.

    Insight: Very unusual outliers showing a large number of loans with a very high Income to Debt ratio 1205 loans > 0.8 DI ratio, but 799 loans > 1 DI ratio

    Wrangling: Dropped 8424 null values.

    Exploring: Did log scale transformation to identify outliers

    Question: What are the loan terms for outliers with 0.8 Debt to Income ratio?

Define
  • Going to use Histograms to explore quantitative variable: StatedMonthlyIncome
Code
In [68]:
# Exploring values of 'StatedMonthlyIncome'
ld.StatedMonthlyIncome
Out[68]:
0          3083.333333
1          6125.000000
2          2083.333333
3          2875.000000
4          9583.333333
5          8333.333333
6          2083.333333
7          3355.750000
8          3333.333333
9          3333.333333
10         7500.000000
11         1666.666667
12         2416.666667
13         5833.333333
14        10833.333333
15         5500.000000
16         8291.666667
17         5833.333333
18         6250.000000
19         3075.000000
20         5166.666667
21         3750.000000
22          118.333333
23         2500.000000
24         2333.333333
25         6974.000000
26         3885.916667
27         6666.666667
28         3600.000000
29        10416.666667
              ...     
113906     5916.666667
113907     2166.666667
113908     2833.333333
113909     3333.333333
113910    18756.000000
113911     3333.333333
113912     2500.000000
113913     9750.000000
113914     2400.000000
113915     4416.666667
113916     4583.333333
113917     1916.666667
113918     5583.333333
113919     2500.000000
113920     5208.333333
113922     5250.000000
113923     3966.666667
113924     3208.333333
113925     2333.333333
113926     5416.666667
113927    10333.333333
113928     2333.333333
113929     4333.333333
113930     6250.000000
113931     8146.666667
113932     4333.333333
113933     8041.666667
113934     2875.000000
113935     3875.000000
113936     4583.333333
Name: StatedMonthlyIncome, Length: 113907, dtype: float64
In [69]:
ld.StatedMonthlyIncome.describe()
Out[69]:
count    1.139070e+05
mean     5.607104e+03
std      7.478469e+03
min      0.000000e+00
25%      3.200000e+03
50%      4.666667e+03
75%      6.818458e+03
max      1.750003e+06
Name: StatedMonthlyIncome, dtype: float64
In [70]:
# Converting floats to int
# df.col = df.col.astype(int)
ld.StatedMonthlyIncome = ld.StatedMonthlyIncome.astype(int)
In [71]:
# Testing
ld.StatedMonthlyIncome.dtype
Out[71]:
dtype('int64')
In [72]:
# plotting carat on a standard scale
binsize = 1000
bins = np.arange(0, ld['StatedMonthlyIncome'].max()+binsize, binsize)

plt.figure(figsize=[8, 5])
plt.hist(data = ld, x = 'StatedMonthlyIncome', bins = bins)
plt.xlim([0,40000])
plt.show()

Right skewed plot with a long tail.

In [73]:
# Converting the xscale and ticks for better readability 
plt.figure(figsize=[10, 7])
bin_edges = 10 ** np.arange(0.8, np.log10(ld.StatedMonthlyIncome.max())+0.1, 0.1)
plt.hist(ld.StatedMonthlyIncome, bins = bin_edges)
plt.xscale('log')

#It is important that the xticks are specified after xscale since that 
#function has its own built-in tick settings.
tick_locs = [100, 500, 2000, 5000, 20000]
plt.xticks(tick_locs, tick_locs);
After doing some exploratory analysis of StatedMonthlyIncome :
  • The plot shows normal distribution with the mean income at 5000

    Insight: USD 5000 monthly income for the majority of borrowers corresponds with the findings in the IncomeRange column where the mean annual income falls in the range of USD 25000 - USD 75000.

After doing some exploratory analysis of Term :
  • 77% of all loans are taken/given for 3 years
  • 21,5% of all loans are taken/given for 5 years
  • Only 1.4% of all loans are taken/given for one year

    Insight: Lender specializes on commercial loans with 1yr., 3yr. and 5yr terms.

    Question: Any differences between 3yr. and 5yr loans? Any similarities within 3yr and 5yr loans?

Summary: Univariate Exploration

Qualitative variables

After doing some exploratory analysis of Term :
  • 77% of all loans are taken/given for 3 years
  • 21,5% of all loans are taken/given for 5 years
  • Only 1.4% of all loans are taken/given for one year

    Insight: Lender specializes on commercial loans with 1yr., 3yr. and 5yr terms.

    Question: Any differences between 3yr. and 5yr loans? Any similarities within 3yr and 5yr loans?

After doing some exploratory analysis of IsBorrowerHomeowner:
  • Homeowner and non-homeowners evenly devided between borrowers - 1000 difference.

    Insight: Lack of homeonwership doesn't decrease chances of getting a loan

    Question: Does homeownership affect the APR or the LA (Loan Amount)?

After doing some exploratory analysis of EmploymentStatus:
  • 59.1% of all borrowers are Employed
  • 23.1% of all borrowers are Full-time
  • 5.4% of all borrowers are Self-employed
  • 4.7% of all borrowers are Not-Available
  • 2255 entries with Null values

Insight: 4.7% of borrowers don't have a regular income.

Question: Is there a difference in loan terms for these 4.7%? And what is the difference between "Employed" and "Full-time"?

After doing some exploratory analysis of IncomeVerifiable :
  • About 8% of all borrowers can't verify their income

    Insight: 8% of borrowers get loans withouth proving their income

    Question: Are there common characteristics among borrowers or their loan terms, that can't verify income?

After doing some exploratory analysis of LoanStatus :
  • Almost 50% of loans are current (paying interest)
  • 33% of all loans are repaid
  • 10.5% are written off (can't be recovered)
  • 4.5% are deliquent

    Insight: 16.7% of all loans are bad business

    Wrangling: Combined the "Past Due" loans. Combined 'FinalPaymentInProgress' with 'Completed' loans. Dropped 'Cancelled' loans. Organised into ordered categorical type

    Question: Are there common characteristics among borrowers or their loan terms, that default? Are there similarities among borrowers or their loan terms, that repaid?

After doing some exploratory analysis of IncomeRange :
  • Income ranges: USD 25,000-49,999 and USD 50,000-74,999 make up 28.3% and 27.3% respectively of all loans, and they make up the largest quantities
  • Income ranges: USD 100,000+ and USD 75,000-99,999 make up 15.2% and 14.8% respectively of all loans. The second largest quantities.

Insight: Income ranges: USD 100,000+ and USD 75,000-99,999 together make up 30% of all loans.

Wrangling: Convert the income ranges into categorical type. Combined '0' and 'Not employed'

Question: Any differences in the loan parameters between the lower ranges and the higher ranges? Any similarities in the loan parameters within the ranges?

Quantitative variables

After doing some exploratory analysis of LoanOriginalAmount :
  • The graph shows that most of the loans are up to USD 5000
  • Graph also shows that most of the loans are taken/given in USD 5000 increments: USD 5K, USD 10000, USD 15000, USD 20000, USD 25000, USD 30000 and USD 35000

    Insight: Lender specializes on commercial loans with, loan amounts (LA): USD 5K, USD 10K, USD 20K, USD 25K, USD 30K, and USD 35K

    Question: What variables affect the LA?

After doing some exploratory analysis of BorrowerAPR :
  • The plot shows that most loans are taken/given around 0.2% rate.
  • Loans with 0.1% and with ) 0.3% APR are in second place
  • The least quantity of loans in the portfolio have 0.4% APR.

    Insight: Quantity of loans with 0.4% is considerably small compared to the other rates.

    Wrangling: Dropped null values.

    Question: Why is there such a big gap between the loans with 0.4% APR and the other rates, considering that 0.1% difference between rates is not that significant?

After doing some exploratory analysis of EmploymentStatusDuration :
  • The plot shows that most borrowers have: 50 to 200 months or 4 to 16 years of employment history.

    Insight: Most of the borrowers have a solid employment history. 13721 loans are given to borrowers with less than 12 months employment history.

    Question: What are the loan terms for the outliers, especially for those with less than 12 months or no employment history?

After doing some exploratory analysis of MonthlyLoanPayment:
  • The plot shows bimodal distribution with picks at USD 160 and USD 400 monthly payments.

    Insight: A very large amount of loans is generating over USD 400 mothly payments.

    Question: Do smaller payments and therefore smaller loans have less delinquent Loan Statuses?

After doing some exploratory analysis of DebtToIncomeRatio :
  • The plot shows that most loans have Debt to Income ratio between 0.1 and 0.2.

    Insight: Very unusual outlier showing a large number of loans with a very high Income to Debt ratio - close to 1

    Wrangling: Dropped 8424 null values.

    Exploring: Did log scale transformation to identify outliers. 1205 loans have 0.8 ratio or higher

    Question: What are the loan terms for outliers with 0.8 Debt to Income ratio?

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

MonthlyLoanPayment variable: When used a regular plot, there was a long tail with outliers. For further exploration log scale was applied.

EmploymentStatusDuration variable: Regular plot revealed right scew with outliers. Log scale was applied to examine the outliers closer.

DebtToIncomeRatio variable: Dropped 8424 null values. Regular plot made no sense. Had to applied log scale to examine the distribution.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

LoanStatus variable: removed the "bucket" for all "Past Due" loans. This way we have 3 delinquent values: Chargedoff, Defaulted, Past Due. Combined 'FinalPaymentInProgress' with 'Completed' loans. Dropped 'Cancelled' loans. Converted into ordered categorical type.

IncomeRange variable: Converted the income ranges into categorical type. Combined '0' and 'Not employed' values.

BorrowerAPR variable: Dropped null values.

MonthlyLoanPayment variable: Looked at the data using a log transform. Under the transformation, the data looked bimodal, with one peak at USD 160 and, and another at USD 400.

EmploymentStatusDuration variable: Log scale revealed outliers to the left of the mean and programatic analysis proved that there 13721 loans with less than 12 months employment history.

DebtToIncomeRatio variable: Log scale transformation clearly showed outliers at the upper end of the ratios. 1205 loans have 0.8 ratio or higher.

Bivariate Exploration

In this section we will investigate relationships between pairs of variables in our data that were introduced in the previous section (univariate exploration).

Categorical:

  • EmploymentStatus = ES
  • IncomeVerifiable = IV
  • LoanStatus = LS
  • IncomeRange = IR
  • IsBorrowerHomeowner = HM

Numeric:

  • LoanOriginalAmount = LA
  • BorrowerAPR = APR
  • EmploymentStatusDuration = EMP
  • MonthlyLoanPayment = MP
  • DebtToIncomeRatio = DIR
Define
  • Preparing data and variables for further exploration
Code
In [74]:
# Select high outliers
outliers = (ld_em_di['DebtToIncomeRatio'] >= 0.6)
print(outliers.sum())
2694
In [75]:
# Dropping loans where DIR > 1
ld_di = ld_em_di[ld_em_di['DebtToIncomeRatio'] < 1]
In [76]:
# Checking that the outliers where dropped
ld_di['DebtToIncomeRatio']
Out[76]:
0         0.17
1         0.18
2         0.06
3         0.15
4         0.26
5         0.36
6         0.27
7         0.24
8         0.25
9         0.25
10        0.12
11        0.27
12        0.18
13        0.09
14        0.20
15        0.49
16        0.15
17        0.12
18        0.24
19        0.41
20        0.20
21        0.09
23        0.39
24        0.16
25        0.12
26        0.11
27        0.26
28        0.12
29        0.32
30        0.11
          ... 
113905    0.35
113906    0.12
113907    0.40
113909    0.24
113910    0.18
113911    0.29
113912    0.20
113913    0.25
113914    0.05
113915    0.26
113916    0.15
113917    0.40
113918    0.39
113919    0.22
113920    0.25
113922    0.18
113923    0.57
113924    0.17
113925    0.28
113926    0.07
113927    0.06
113928    0.27
113929    0.05
113930    0.20
113931    0.28
113932    0.13
113933    0.11
113934    0.51
113935    0.48
113936    0.23
Name: DebtToIncomeRatio, Length: 104540, dtype: float64
In [77]:
# Assign ld_di DF to a new "Univariate Explored" DF - ld_u for further explorations
ld_u = ld_di
In [78]:
# Setting up the numeric variables
numeric_vars = ['LoanOriginalAmount', 'BorrowerAPR', 'EmploymentStatusDuration', 'MonthlyLoanPayment', 'DebtToIncomeRatio']
In [79]:
# correlation plot
plt.figure(figsize = [8, 5])
sb.heatmap(ld_em_di[numeric_vars].corr(), annot = True, fmt = '.3f',
           cmap = 'vlag_r', center = 0)
plt.show()

Negative correlation between APR and MP, because of negative correlation b/w APR and LA. Small LA has a bigger APR (b/c risk), and because of the LA size, the MP is small. The opposite is true!

The negative relationship b/w APR and LA (-0.3) doesn't translate into a negative relationship between EMP and APR (0.004), logically higher EMP should yeild lower APR.

Very strange that, there is no strong positive correlation b/w EMP and LA (0.09), Even though higher EMP should yeild bigger LA, unless when higher EMP doesn't mean higher IR (Income Range), and that's a very intersting insight.

Negative correlation b/w APR and MP (-0.23). Lower APR means bigger LA, bigger LA means bigger MP - that's why there's a negative relationship b/w APR and MP.

What influences APR?

In [80]:
# plot matrix: sample 500 loans so that plots are clearer and they render faster
samples = np.random.choice(ld_u.shape[0], 500, replace = False)
loans_samp = ld_u.loc[samples,:]

g = sb.PairGrid(data = loans_samp, vars = numeric_vars)
g = g.map_diag(plt.hist, bins = 20)
g.map_offdiag(plt.scatter);
/Users/SeniKamara/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:1472: FutureWarning: 
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)

We see an obvious correlation b/w LA and MP, the bigger the loan, the bigger the monthly payment. Other variables don't show noticeable correlation between variables.

We can also see that there is no strength in the relatioships we raised earlier. Meaning that the levels of negative and positive correlation b/w the variables displayed in the previous heatmap are not evident.

In [81]:
# plot scatter plot: sample 500 loans so that plots are clearer and they render faster
samples = np.random.choice(ld_u.shape[0], 1000, replace = False)
loans_samp = ld_u.loc[samples,:]

# Scatter plot for EMP and LA
# plt.scatter(data = df, x = 'num_var1', y = 'num_var2')
plt.scatter(data = loans_samp, x = 'EmploymentStatusDuration', y = 'BorrowerAPR');
# plt.xlabel('EmploymentStatusDuration')
# plt.ylabel('BorrowerAPR');

No relationship can be seen, this means that the correlation from the heatmap is insignificant

In [82]:
# Using seaborn reg.plot with regression line
# sb.reg plot(data = df, x = 'num_var1', y = 'num_var2')
sb.regplot(data = loans_samp, x = 'EmploymentStatusDuration', y = 'BorrowerAPR');

No obvious relationship can be seen

In [83]:
# Checking supposed negative relationship b/w APR and LA (-0.3)

# Sample 500 loans so that plot is clearer and renders faster
samples = np.random.choice(ld_u.shape[0], 500, replace = False)
loans_samp = ld_u.loc[samples,:]

sb.regplot(data = loans_samp, x = 'LoanOriginalAmount', y = 'BorrowerAPR');

Here the negative correlation b/w the APR and the LA is presented by the regression line. The slope is not that steep and the points around it don't align to strengthen the support for this relationship.

However, another insight is coming to light. There seems to be a fixed relationship, where a paticular LA (10K, 15K, 20K...) is given out at different APRs.

Let's increase the sample.

In [84]:
# Checking supposed negative relationship b/w APR and LA (-0.3)

# Sample 500 loans so that plot is clearer and renders faster
samples = np.random.choice(ld_u.shape[0], 2000, replace = False)
loans_samp = ld_u.loc[samples,:]

sb.regplot(data = loans_samp, x = 'LoanOriginalAmount', y = 'BorrowerAPR');

Yes, we clearly see that LA = USD 25K has been approved at APR range of 0.07 to 0.34, and even though it is a pretty wide range, (considering the the full APR range examined is: 0.01 to 0.44) the ranges of APR get smaller as the LA increses and that is what is supporting the slight negative correlation b/w APR and LA.

Let's look at relationship b/w DIR and APR

In [85]:
# Checking supposed negative relationship b/w DIR and APR

# Sample 500 loans so that plot is clearer and renders faster
samples = np.random.choice(ld_u.shape[0], 500, replace = False)
loans_samp = ld_u.loc[samples,:]
plt.xticks([0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])
plt.yticks([0, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])
sb.regplot(data = loans_samp, x = 'DebtToIncomeRatio', y = 'BorrowerAPR');

Shows a slight positive relationship, meaning higher DIR leads to higher APR. That can be understood in terms of risk.

Let's look at relationship b/w DIR and LA

In [86]:
# Sample 5000 loans so that plot is clearer and renders faster
samples = np.random.choice(ld_u.shape[0],5000, replace = False)
loans_samp = ld_u.loc[samples,:]

# Plot DIR vs LA
sb.regplot(data = loans_samp, x = 'LoanOriginalAmount', y = 'DebtToIncomeRatio');

Not much correlation, but there is a pattern which shows that most LAs are issued with DIR < 0.60. This can be understood because of risk. We can see a pattern that 35K LA stops at 0.4 DIR; 25K LA level has higer DIR on record (0.6); 15K LA registered (0.9) and so on. So we see a relationship: higher DIRs - lower LAs. This is logical: lendging large amounts to borrowers with high debt burdern is very risky and irresponsible.

Loans with "DebtToIncomeRatio" more than 60% are riskier than the ones with a smaller ratio.

Insigt: It will be good to check the "LoanStatus" and other characteristics of these riskier loans

This is an area of special attention. Debt Income ratio more than "1" means that the borrower has more debts than income, and that is problematic situation!

There is a slight positive correlation b/w DI and APR, not sure how significant is this level of correlation for our purposes.

Will now explore relatioship between EMP and LA (0.09)

In [87]:
# Sample 2000 loans so that plot is clearer and renders faster
samples = np.random.choice(ld_u.shape[0], 2000, replace = False)
loans_samp = ld_u.loc[samples,:]

sb.regplot(data = loans_samp, x = 'LoanOriginalAmount', y = 'EmploymentStatusDuration');

No evident correlation between EMP and LA

In [88]:
ld_u.EmploymentStatusDuration.median()
Out[88]:
68.0
In [89]:
ld_u.EmploymentStatusDuration.describe()
Out[89]:
count    97202.000000
mean        97.071696
std         94.249036
min          0.000000
25%         26.000000
50%         68.000000
75%        139.000000
max        755.000000
Name: EmploymentStatusDuration, dtype: float64
In [90]:
ld_u.LoanOriginalAmount.median()
Out[90]:
6500.0
In [91]:
ld_u.LoanOriginalAmount.describe()
Out[91]:
count    104540.000000
mean       8459.057366
std        6336.415483
min        1000.000000
25%        4000.000000
50%        6500.000000
75%       12000.000000
max       35000.000000
Name: LoanOriginalAmount, dtype: float64

Not much correlation b/w variables, but again we see data point fall in stairghl line according to the distinct LA values (10K, 15K, etc...) This may mean that the loans are sold and packaged at these LAs.

We have an interesting insight: Even though most of the loans are in the range 0 to 140 EMP and up to 12K, there are outliers with 35K loans at 0 to 100 EMP levels. What characteristic of a borrower made the lender ingnore little or no employment history to give maximum LA to minimum EMP?

Time to dive deeper into bivariate exploration between qualitative and quantitative variables.

It will be interesting to investigate how the parameters of:

Numeric:

  • LoanOriginalAmount = LA
  • BorrowerAPR = APR

are afected by the categories of:

Categorical:

  • EmploymentStatus = ES
  • IncomeVerifiable = IV
  • IncomeRange = IR

How:

Numeric:

  • LoanOriginalAmount = LA
  • BorrowerAPR = APR
  • EmploymentStatusDuration = EMP
  • DebtToIncomeRatio = DIR

differ according to the LS (Loan Status):

Categorical:

  • LoanStatus = LS
In [92]:
# Checking for categories
ld_u.EmploymentStatus.value_counts()
Out[92]:
Employed         65683
Full-time        25415
Not available     5136
Other             3458
Self-employed      986
Part-time          899
Retired            713
Not employed        65
Name: EmploymentStatus, dtype: int64
In [93]:
# Setting up 'EmploymentStatus' categorical variables 
empstatus_vars = ['Employed', 'Full-time', 'Not available', 
                  'Other', 'Self-employed', 'Part-time', 
                  'Retired', 'Not employed']
estatuses = pd.api.types.CategoricalDtype(ordered = True, 
                                        categories = empstatus_vars)
ld_u['EmploymentStatus'] = ld_u['EmploymentStatus'].astype(estatuses);
/Users/SeniKamara/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys
In [94]:
base_color = sb.color_palette()[0]

sb.violinplot(data = ld_u, x = 'EmploymentStatus', 
              y = 'LoanOriginalAmount', color = base_color)
plt.xticks(rotation = 20);

In it is evedent from the plot that status "Employed" get more of larger LAs compared to the other statuses.

Status "Other" displays more 10K and 15K LAs. The long tail suggest that there're outliers all the way to 35K LA. Insight: very interesting to know, what is meant by "Other"? What additional documents the bank is asking to qualify borrowers with this status for a loan.

Status "Full-time" is in the third place with fewer LAs at 10K and 15K LAs, compare to "Other", but has more loans issued with 25K LA. It is not clear what statuses "Employed", "Full-time" and "Other" really mean to the bank. Why he issues more 10K and 15K LAs to "Other" than "Full-time", but more 25K LAs to "Employed" than "Other"

However it is evident that the group with less or no employment verification yeilds smaller LAs. "Not-available", "Part-time", "Retired" and "Not employed" have most LAs at 2.5K level, with tails that stop at 15K and 25K LAs. This is considerable lower than the: "Employed", "Full-time" and "Other" group with tails reaching 35K LAs

In [95]:
base_color = sb.color_palette()[0]
sb.violinplot(data = ld_u, x = 'EmploymentStatus', 
              y = 'BorrowerAPR', color = base_color)
plt.xticks(rotation = 20);

For the majority loans in the "low employment" categories ("Part-time", "Retired" and "Not employed" except for "Not-available"), the APR is lower. However, the minimum APR for this categories is higher than "Not-availabe". "Not-availabe" is capped 0.32 APR while the rest of the "low employment" cats rise beyond 0.4 APR surpassing the "high employment" cats. ("Employed", "Full-time", "Other").

"Not available" has a significant amount of loans with a high APR = 0.3%, but no outliers at the higher end, signalling that its a risk group. Similar situation with "Other" with many loans at APR = 0.35% and a very short bottom tail, meaning no low APRs for this category - risk group. However, it is not clear why this logic doesn't spread to the "low employment" cats.

Insight: Why do we have loans with APR = 0 or close to 0?

In [96]:
# Identify the minimum value
ld_u['BorrowerAPR'].min()
Out[96]:
0.006529999999999999
In [97]:
# Select low outliers and count them
out_apr = (ld_u['BorrowerAPR'] <= 0.1)
print(out_apr.sum())
7080

Why is the bank giving out loans at almost 0% interest?

We can conclude that Employment Status of a borrower affects the Loan Amount and the APR that is given by the bank
In [98]:
# Setting up the categoric variables
cat_vars = ['IncomeVerifiable', 'IncomeRange']
In [99]:
# Renaming the categories for visualisation
ld_u['IncomeRange'] = ld_u['IncomeRange'].replace('$1-24,999', '$1-24.9K')
ld_u['IncomeRange'] = ld_u['IncomeRange'].replace('Not displayed', 'ND')
ld_u['IncomeRange'] = ld_u['IncomeRange'].replace('$75,000-99,999', '$75K-99.9K')
ld_u['IncomeRange'] = ld_u['IncomeRange'].replace('$100,000+', '$100K+')
ld_u['IncomeRange'] = ld_u['IncomeRange'].replace('$50,000-74,999', '$50K-74.9K')
ld_u['IncomeRange'] = ld_u['IncomeRange'].replace('$25,000-49,999', '$25K-49.9K')
ld_u['IncomeRange'] = ld_u['IncomeRange'].replace('Not employed', 'NE')
/Users/SeniKamara/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
/Users/SeniKamara/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
/Users/SeniKamara/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
/Users/SeniKamara/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
/Users/SeniKamara/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
/Users/SeniKamara/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys
/Users/SeniKamara/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
In [100]:
# Setting up 'IncomeRange' categorical variables 
incrange_vars = ['NE', 'ND', '$1-24.9K', '$25K-49.9K', 
                  '$50K-74.9K', '$75K-99.9K', '$100K+']
incranges = pd.api.types.CategoricalDtype(ordered = True, 
                                        categories = incrange_vars)
ld_u['IncomeRange'] = ld_u['IncomeRange'].astype(incranges);
/Users/SeniKamara/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
In [101]:
ld_u.IncomeRange.value_counts()
Out[101]:
$25K-49.9K    29825
$50K-74.9K    29346
$100K+        16066
$75K-99.9K    16012
ND             7436
$1-24.9K       5805
NE               50
Name: IncomeRange, dtype: int64
In [102]:
# plot matrix of numeric features against categorical features.
# can use a larger sample since there are fewer plots and they're simpler in nature.

samples = np.random.choice(ld_u.shape[0], 20000, replace = False)
ldu_samp = ld_u.loc[samples,:]

def boxgrid(x, y, **kwargs):
    """ Quick hack for creating box plots with seaborn's PairGrid. """
    default_color = sb.color_palette()[0]
    sb.boxplot(x, y, color = default_color)

plt.figure(figsize = [8, 8])
g = sb.PairGrid(data = ldu_samp, y_vars = ['LoanOriginalAmount', 'BorrowerAPR'], x_vars = cat_vars,
                size = 4, aspect = 1.5)
g.map(boxgrid)
plt.show();
/Users/SeniKamara/anaconda3/lib/python3.7/site-packages/seaborn/axisgrid.py:1241: UserWarning: The `size` paramter has been renamed to `height`; please update your code.
  warnings.warn(UserWarning(msg))
<Figure size 576x576 with 0 Axes>
In [103]:
ld.IncomeRange.value_counts()
Out[103]:
$25,000-49,999    32191
$50,000-74,999    31050
$100,000+         17337
$75,000-99,999    16916
Not displayed      7712
$1-24,999          7274
Not employed       1427
Name: IncomeRange, dtype: int64

IV and LA - from the plot it is obvious that verifiable provides for bigger LA, especially if we consider the level of outliers. This is logical, with proof of income, the lender is willing to lend more money.

IV and APR - the plot shows that with verifiable income the APR range is larger, as well as the IQR. This is very interesting, because higher risk (no verifiable income) should be reflected in higher APR. But very high APR on the risky group (no verifiable income) may raise the risk of default, so that maybe the reason for a smaller range in this group. Lender on one hand doesn't give low APR to this group on the other hand doesn't lend with high APRs either. According to the plot, "no income verification" status, raises the minimum APR and lowers the maximum APR for the borrower.

IR and LA - Interesting insight, ND (Not displayed) group has very large outliers (25K), reaching the levels for which qualifies 99K IR. What is the true meaning of ND status?

Some outliers in the "1-25K" group are also large (25K). But the main mass - in the IQR is up to 6K. "1-25K" IR has overall slightly lower LAs compared to "ND" IR. The "NE" (Not Employed) group has the lowest LAs, and that is understandable. The rest of the chart is self explanatory: The higher the IR, the higher the LA.

IR and APR - Interesting insight, minimum levels of APR that are available for "ND" and "25K-49.9K" IR are not availabe for the hiher IRs.

Insight: How does the lender justify higher APRs to the group with "verifiable income" (IV) and high IR?

I will have to segment the dataset according the LS and see if there is any correlation with respect to variables.

How:

Numeric:

  • LoanOriginalAmount = LA
  • BorrowerAPR = APR
  • EmploymentStatusDuration = EMP
  • DebtToIncomeRatio = DI

differ according to the LS (Loan Status):

Categorical:

  • LoanStatus = LS
In [104]:
base_color = sb.color_palette()[0]
sb.violinplot(data = ld_u, x = 'LoanStatus', 
              y = 'LoanOriginalAmount', color = base_color)
plt.xticks(rotation = 20);

Insight: "Defaulted" and "Chargedoff" LSs are almost identical in shape and parameters compared to the other LSs! This leads to assumption that these groups of borrowers possess similar characteristics.

In [105]:
base_color = sb.color_palette()[0]
sb.violinplot(data = ld_u, x = 'LoanStatus', 
              y = 'BorrowerAPR', color = base_color)
plt.xticks(rotation = 20);

Insight: "Current" and "Past Due" LSs don't have very high APR compared to "Defaulted" and "Chargedoff".

In [106]:
plt.figure(figsize = [12, 5])
base_color = sb.color_palette()[0]

# left plot: violin plot
plt.subplot(1, 2, 1)
ax1 = sb.violinplot(data = ld_u, x = 'LoanStatus', y = 'EmploymentStatusDuration', color = base_color)

# right plot: box plot
plt.subplot(1, 2, 2)
sb.violinplot(data = ld_u, x = 'LoanStatus', y = 'DebtToIncomeRatio', color = base_color);
#plt.ylim(ax1.get_ylim()) # set y-axis limits to be same as left plot
In [107]:
ld_u['DebtToIncomeRatio'].max()
Out[107]:
0.99

Insight: it is very surprising that all categories of LS have loans that have DIs at maximum level of 0.99

Bivariate exploration between categorical variables

In [108]:
ld_u.EmploymentStatus.value_counts()
Out[108]:
Employed         65683
Full-time        25415
Not available     5136
Other             3458
Self-employed      986
Part-time          899
Retired            713
Not employed        65
Name: EmploymentStatus, dtype: int64
In [109]:
ld_esx = ld_u.loc[(ld_u['EmploymentStatus']=='Employed') | 
                   (ld_u['EmploymentStatus']=='Full-time') | 
                   (ld_u['EmploymentStatus']=='Not available')]
ld_esx.head()
Out[109]:
ListingKey ListingNumber ListingCreationDate CreditGrade Term LoanStatus ClosedDate BorrowerAPR BorrowerRate LenderYield ... LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
1 10273602499503308B223C1 1209647 2014-02-27 08:28:07.900000000 NaN 36 Current NaN 0.12016 0.0920 0.0820 ... 0.00 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
2 0EE9337825851032864889A 81716 2007-01-05 15:00:47.090000000 HR 36 Completed 2009-12-17 00:00:00 0.28269 0.2750 0.2400 ... -24.20 0.0 0.0 0.0 0.0 1.0 0 0 0.0 41
3 0EF5356002482715299901A 658116 2012-10-22 11:02:35.010000000 NaN 36 Current NaN 0.12528 0.0974 0.0874 ... -108.01 0.0 0.0 0.0 0.0 1.0 0 0 0.0 158
4 0F023589499656230C5E3E2 909464 2013-09-14 18:38:39.097000000 NaN 36 Current NaN 0.24614 0.2085 0.1985 ... -60.27 0.0 0.0 0.0 0.0 1.0 0 0 0.0 20
5 0F05359734824199381F61D 1074836 2013-12-14 08:26:37.093000000 NaN 60 Current NaN 0.15425 0.1314 0.1214 ... -25.33 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1

5 rows × 81 columns

In [110]:
ld_esx.EmploymentStatus.value_counts()
Out[110]:
Employed         65683
Full-time        25415
Not available     5136
Not employed         0
Retired              0
Part-time            0
Self-employed        0
Other                0
Name: EmploymentStatus, dtype: int64
Exploring higher employment status
In [112]:
#sb.countplot(data = ld_empx, x = 'EmploymentStatus', hue = 'IncomeRange');
ax = sb.countplot(data = ld_esx, x = 'EmploymentStatus', hue = 'IncomeRange')
ax.legend(loc = 7, ncol = 1, framealpha = 1, title = 'IncomeRange');
plt.xticks(rotation = 30);

The plot clearly shows that "Employed" status has more borrowers than "Full-time" with the majority in 50K-74.9K IR and 25K-49.9K IR in the second place. For "Full-time" borrowers the majority is in 25K-49.9K IR and 50K-74.9K IR in the second place.

In [113]:
# Analizing the other ESs

ld_esxy = ld_u.loc[(ld_u['EmploymentStatus']=='Not employed') | 
                    (ld_u['EmploymentStatus']=='Retired') | 
                    (ld_u['EmploymentStatus']=='Part-time') | 
                    (ld_u['EmploymentStatus']=='Self-employed') | 
                    (ld_u['EmploymentStatus']=='Other')]
ld_esxy.head()
Out[113]:
ListingKey ListingNumber ListingCreationDate CreditGrade Term LoanStatus ClosedDate BorrowerAPR BorrowerRate LenderYield ... LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
0 1021339766868145413AB3B 193129 2007-08-26 19:09:29.263000000 C 36 Completed 2009-08-14 00:00:00 0.16516 0.1580 0.1380 ... -133.18 0.0 0.00 0.00 0.0 1.0 0 0 0.0 258
15 0F483544120452347F48121 577164 2012-04-10 09:14:46.297000000 NaN 36 Defaulted 2012-12-19 00:00:00 0.35797 0.3177 0.3077 ... -9.81 0.0 3790.25 3790.25 0.0 1.0 0 0 0.0 10
31 0F683558129866945165D36 643927 2012-09-21 13:37:43.210000000 NaN 36 Current NaN 0.35797 0.3177 0.3077 ... -23.95 0.0 0.00 0.00 0.0 1.0 0 0 0.0 24
34 0F6C3603015887476F3F015 1180690 2014-02-24 12:55:20.430000000 NaN 36 Current NaN 0.22966 0.1920 0.1820 ... 0.00 0.0 0.00 0.00 0.0 1.0 0 0 0.0 1
38 109A35861259973172F6548 869272 2013-08-13 16:11:17.173000000 NaN 36 Current NaN 0.08930 0.0759 0.0659 ... -33.08 0.0 0.00 0.00 0.0 1.0 0 0 0.0 151

5 rows × 81 columns

Exploring lower employment status
In [114]:
#sb.countplot(data = ld_empx, x = 'EmploymentStatus', hue = 'IncomeRange');
ax = sb.countplot(data = ld_esxy, x = 'EmploymentStatus', hue = 'IncomeRange')
ax.legend(loc = 6, ncol = 1, framealpha = 1, title = 'IncomeRange');
plt.xticks(rotation = 30);

Status "Other" is showing the largest number of borrowers compared to the other ESs, with its primary IR of 25K-49.9K, which is almost twice bigger than the next 2 IRs, 1K-24.9K and 50K-74.9K. 25K-49.9K is also leading in the "Self-employed" ES, and 1K-24.9K is the main IR in the "Part-time" ES.

In [115]:
ld_u.IncomeVerifiable.value_counts()
Out[115]:
True     104447
False        93
Name: IncomeVerifiable, dtype: int64

Interesting group "False" to explore

In [116]:
# Exploring with faceting
bins = np.arange(0, 0.4+0.1, 0.1 )
g = sb.FacetGrid(data = ld_u, col = 'IncomeRange', col_wrap = 4, sharey = False)
g.map(plt.hist, 'BorrowerAPR')
Out[116]:
<seaborn.axisgrid.FacetGrid at 0x1a26d42dd8>
In [117]:
# Exploring with faceting
g = sb.FacetGrid(data = ld_u, col_wrap = 4, col = 'IncomeRange')
g.map(plt.hist, 'LoanOriginalAmount');

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Surprisingly EMP (Length of employment of a borrower) has no effect on the APR or the LA value of the loan.

Exploration showed that: ES (Employment Status) of a borrower affects the Loan Amount (LA). Higher ES qualifies for higher LAs. The opposite is true.

Anylysis revealed that ES has a differentiated relationship with APR.

For the majority loans in the "low employment" categories: "Part-time", "Retired" and "Not employed" (except for "Not-available"), the APR is lower than the "higher employment" group: "Employed", "Full-time", "Other".

However, the minimum APR for the "lower" categories is higher than "Not-availabe". "Not-availabe" is capped at 0.32 APR, while the rest of the "low employment" cats rise beyond 0.4 APR, surpassing the "high employment" cats ("Employed", "Full-time", "Other").

"Not available" has a significant amount of loans with a high APR = 0.3%, but no outliers at the higher end, signalling that its a risk group. Similar situation with "Other" with many loans at APR = 0.35% and a very short bottom tail, meaning no low APRs for this category - risk group. However, it is not clear why this logic doesn't spread to the "low employment" cats.

Exploration showed that borrowers with higher DIR (Debt to Income Ratio) generally get higher APRs.

Analysis revealed that larger LAs generally have smaller DIRs. This can be understood because of risk.

Our analysis revealed that IV (Income Verification) status has a positive relationship with LA. This is logical, with proof of income, the lender is willing to lend more money.

Exploring relationship between IV and APR we discovered that with verifiable income (VI) the APR range is larger, as well as the IQR, compared to no IV.

This is very interesting, because higher risk (no verifiable income) should be reflected in higher APRs (interest rates). But very high APR on the risky group (no verifiable income) may raise the risk of default, so that maybe the reason for a smaller range in this group. Lender on one hand doesn't give low APR to this group on the other hand doesn't lend with high APRs either. According to the plot, "no income verification" status, raises the minimum APR and lowers the maximum APR for the borrower.

Our exploratory analysis showed that IR and LA have a positive relatioship: borrowers from higher income range (IR) quaify for higher loan amounts (LAs).

According to our analysis there is no evident reationship between IR and APR. This means that borrower's income range (IR) has no effect on the APR (interest rate) he receives.

Eploration of ES (employment status) and IR (income range) relatioship provided understanding of the ranking of the ES according to the income size: Employed, Full-time, Other, Not Available, Self-employed, Part-time, Retired, Not-emploeyed

We also observed that larger LAs (loan amounts) generally have smaller APRs.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

It is not clear what statuses "Employed", "Full-time" and "Other" really mean to the bank. Why he issues more 10K and 15K LAs to "Other" than "Full-time", but more 25K LAs to "Employed" than "Other"

Even within the "higher employment" group it is not clear why "Full-time" has a lower APR than "Employed", considering the fact that "Employed" displayed a higher IR than "Full-time"

Interesting insight, ND (Not displayed - low income range group) group has very large outliers with LAs (loan amounts - 25K). These are levels for which 99K IR (high income range group) qualifies. What is the true meaning of ND status?

Interesting insight, minimum levels of APR that are available for "ND" (Not Dispalyed - income range group) and "25K-49.9K" IR, are not availabe for the higher IRs.

Insight: How does the lender justify higher APRs to the group with "verifiable income" (IV) and high IR "income range"?

Multivariate Exploration

During Bivariate Exploration we have identified borrower's parameters (variables) that affect loan parameters: Loan Amount (LA) and APR (Annual Percentage Rate)

  • ES - Employment Status

  • DIR - Debt to Income Ratio

  • IR - Income Range

  • HOW - Home ownership

We've also raised a lot of questions during our bivariate exploration, all of which we will not attempt to answer within the scope of this project.

Our further Multivariate Exploration we will focus on visualizing the relatioship between borrower's parameters - ES, DIR, IR; and loan parameters - LA, APR, silmontaneously, to identify trends and insights.

We would also explore how the three categorical measures of LS (Loan Status) play into the relationship between loan parameters LA and APR.

We will also introduce a new borrower's parameter, variable HOW (IsBorrowerHomeowner) to our exploration. To see if affects loan parameters LA and APR.

How ES plays into the relationship between: LA and APR
In [118]:
ld_u.EmploymentStatus.value_counts()
Out[118]:
Employed         65683
Full-time        25415
Not available     5136
Other             3458
Self-employed      986
Part-time          899
Retired            713
Not employed        65
Name: EmploymentStatus, dtype: int64
In [119]:
samples = np.random.choice(ld_u.shape[0], 2000, replace = False)
loans_samp = ld_u.loc[samples,:]


g = sb.FacetGrid(data = loans_samp, hue = 'EmploymentStatus',
                hue_order = ['Employed', 'Full-time'], 
                 height = 6, aspect = 2)
g = g.map(sb.regplot, 'LoanOriginalAmount', 'BorrowerAPR', x_jitter = 0.04, fit_reg = False);
g.add_legend();

It is evident that "Employed" status receives bigger loans at lower interest compared to "Full-time" status.

How IR plays into the relationship between: LA and APR
In [120]:
ld_u.IncomeRange.value_counts()
Out[120]:
$25K-49.9K    29825
$50K-74.9K    29346
$100K+        16066
$75K-99.9K    16012
ND             7436
$1-24.9K       5805
NE               50
Name: IncomeRange, dtype: int64
In [121]:
samples = np.random.choice(ld_u.shape[0], 1000, replace = False)
loans_samp = ld_u.loc[samples,:]

g = sb.FacetGrid(data = loans_samp, 
                 hue = 'IncomeRange', 
                 height = 4, aspect = 1.5, 
                 palette = 'viridis_r') #palette instead of hue_order
g = g.map(sb.regplot, 'LoanOriginalAmount', 'BorrowerAPR', x_jitter = 0.04, fit_reg = False);
g.add_legend();

It is clearly seen that "darker", 100K+ IR qulify for 15K LA onwards. And "lighter", lower IR qualify for smaller loan amounts. We also see that larger loans for high IR come with a lower APR than the lower ones.

How DIR plays into the relationship between: LA and APR
In [122]:
samples = np.random.choice(ld_u.shape[0],700, replace = False)
loans_samp = ld_u.loc[samples,:]

plt.figure(figsize = [10, 8])


sb.regplot(data = loans_samp, x = 'LoanOriginalAmount', y = 'BorrowerAPR', 
           x_jitter = 0.1, fit_reg = False, 
           scatter_kws = {'s' : loans_samp['DebtToIncomeRatio']*300});

#    plt.xlabel('Loan Amount in USD')
#    plt.ylabel('Debt to Income Ratio')
#    plt.legend(['Completed', 'Chargedoff'])

There is no visible trend or relationship on the plot above

Home ownership (HOW) vs Loan parameters (LA and APR)
In [123]:
# Checking for the different values in the IsBorrowerHomeowner column
ld.IsBorrowerHomeowner.value_counts()
Out[123]:
True     57476
False    56431
Name: IsBorrowerHomeowner, dtype: int64
In [124]:
samples = np.random.choice(ld_u.shape[0], 5000, replace = False)
loans_samp = ld_u.loc[samples,:]

g = sb.FacetGrid(data = loans_samp, col = 'LoanStatus', row = 'IsBorrowerHomeowner',
                margin_titles = True)
g.map(plt.scatter, 'LoanOriginalAmount', 'BorrowerAPR');

"Current" LS shows that now homeowners are priveleged with higher LAs in comparison to "non-homeowners". "Completed" LS displays a similar relationship as above, but at lower LAs. "Chargedoff" is only slightly in favor of homeowners with respect to LAs. "Defaulted" LS is showing "non-homeowners" with LAs that are clustered around small values as opposed the homeowners where the defaults are spread accross smaller and larger LAs.

In [125]:
ld_u.IncomeRange.value_counts()
Out[125]:
$25K-49.9K    29825
$50K-74.9K    29346
$100K+        16066
$75K-99.9K    16012
ND             7436
$1-24.9K       5805
NE               50
Name: IncomeRange, dtype: int64
How LS plays into the relationship between: LA and APR
In [126]:
ld_u.LoanStatus.value_counts()
Out[126]:
Current       52323
Completed     35151
Chargedoff    10568
Defaulted      4657
Past Due       1841
Name: LoanStatus, dtype: int64

"Current" vs "Chargedoff"

In [127]:
samples = np.random.choice(ld_u.shape[0],5000, replace = False)
loans_samp = ld_u.loc[samples,:]


ttype_markers = [['Current', 'o'], 
                ['Chargedoff', '^']]
plt.figure(figsize = [10, 8])

for ttype, marker in ttype_markers:
    plot_data = loans_samp.loc[loans_samp['LoanStatus'] == ttype]
    sb.regplot(data = plot_data, x = 'LoanOriginalAmount', y = 'BorrowerAPR', 
              x_jitter = 0.04, fit_reg = False, marker = marker);
#    plt.xlabel('Loan Amount in USD')
#    plt.ylabel('Debt to Income Ratio')
#    plt.legend(['Completed', 'Chargedoff'])

We can see the how the lender with the "Current" loans have used past experience and adjusted his lending strategy from "small" LAs < 1,5K to larger LAs > K2,5 LAs, and from high APRs > 0.36 to > 0.36 larger sums at smaller APRs.

"Defaulted" vs "Current"

In [128]:
ttype_markers = [['Current', 'o'], 
                ['Defaulted', '^']]
plt.figure(figsize = [10, 8])

for ttype, marker in ttype_markers:
    plot_data = loans_samp.loc[loans_samp['LoanStatus'] == ttype]
    sb.regplot(data = plot_data, x = 'LoanOriginalAmount', y = 'BorrowerAPR', 
              x_jitter = 0.04, fit_reg = False, marker = marker);
#    plt.xlabel('Loan Amount in USD')
#    plt.ylabel('Debt to Income Ratio')
#    plt.legend(['Completed', 'Chargedoff'])

Same shift away from smaller loans and high rates is shown in the graph above.

"Past Due" vs "Current"

In [129]:
ttype_markers = [['Current', 'o'], 
                ['Past Due', '^']]
plt.figure(figsize = [10, 8])

for ttype, marker in ttype_markers:
    plot_data = loans_samp.loc[loans_samp['LoanStatus'] == ttype]
    sb.regplot(data = plot_data, x = 'LoanOriginalAmount', y = 'BorrowerAPR', 
              x_jitter = 0.04, fit_reg = False, marker = marker);
#    plt.xlabel('Loan Amount in USD')
#    plt.ylabel('Debt to Income Ratio')
#    plt.legend(['Completed', 'Chargedoff'])

The lender's shift towards larger loans: 20K to 35K, for lower rates has beared fruit with little or no delinquencies within the lower LAs, where most bad loans registered in the past. The strategy also paid of at the higher end with no delinquencies with the larges loans.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

How ES plays into the relationship between: LA and APR

Multivariate Exploration supported the findings we got in Bivariate Exploration: the "Employed" status receives bigger loans at lower interest rates compared to "Full-time" and "Not available" employment statuses.

  • We can conclude that Employment Status (ES) affects parameters of the loan (LA and APR)
How IR plays into the relationship between: LA and APR

In this exploration we again observed the relationship we found in Bivariate Exploration between IR, LA and APR. Similar to ES, high IRs receive higher LAs and lower APRs.

  • Here we can conclude that IR influences loan parameters similarly to ES
Home ownership (HOW) vs Loan parameters (LA and APR)

This analysis showed us that currently, the bank favors homeowners with larger loans. Homeownership didn't reveal differences in parameters for "bad" loans. However, "Defaulted" LSs (which are more current in comparison to the "Chargedoff" LSs) showed the non-homeowners with mainly smaller LAs compared to homeowners. These observations perfectly play into the new strategy that the lender has adopted, which we will explain with the next exploration results.

  • We can conclude that homeownership affects the Loan Amount. However we didn't observe relationship with APR

Were there any interesting or surprising interactions between features?

How LS plays into the relationship between: LA and APR
  • "Current" vs "Chargedoff"

Exploration showed how with the "Current" loans the lender has used past experience and adjusted his lending strategy from "small" LAs < 1,5K to larger LAs > 2,5K LAs, and from high > 0.36 to > 0.36 APRs. Tactics: Larger sums at smaller APRs.

  • "Defaulted" vs "Current"

Analysis showed the same shift away from smaller loans and high rates as above. This makes sense, since "Defaulted" is a more current LS than "Chalgedoff", reflecting the current risk management that we see in "Current" LSs.

  • "Past Due" vs "Current"

Here the exploration displayed how the lender's shift towards larger loans: 20K to 35K, for lower rates has beared fruit with little or no delinquencies within the lower LAs, where most bad loans registered in the past. The strategy also paid off at the higher end with no delinquencies with the largest LAs.

  • We can conclude from these results that the lender is very thorough in analysing his lending products, and how the borrowers parameters fit into them.
In [130]:
ld_u.shape
Out[130]:
(104540, 81)